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ABSTRACT 

Motivation: The recent outbreak of severe acute respiratory 
syndrome (SARS) caused by SARS coronavirus (SARS- 
CoV) has necessitated an in-depth molecular understand¬ 
ing of the virus to identify new drug targets. The 
availability of complete genome sequence of several strains 
of SARS virus provides the possibility of identification of 
protein-coding genes and defining their functions. Com¬ 
putational approach to identify protein-coding genes and 
their putative functions will help in designing experimental 
protocols. 

Results: In this paper, a novel analysis of SARS genome 
using gene prediction method GeneDecipher developed in 
our laboratory has been presented. Each of the 18 newly 
sequenced SARS-CoV genomes has been analyzed using 
GeneDecipher. In addition to polyprotein lab * 1 , polyprotein 
la and the four genes coding for major structural proteins 
spike (S), small envelope (E), membrane (M) and nucleo- 
capsid (N), six to eight additional proteins have been predicted 
depending upon the strain analyzed. Their lengths range 
between 61 and 274 amino acids. Our method also suggests 
that polyprotein lab, polyprotein la, S, M and N are pro¬ 
teins of viral origin and others are of prokaryotic. Putative 
functions of all predicted protein-coding genes have been sug¬ 
gested using conserved peptides present in their open reading 
frames. 

Availability: Detailed results of GeneDecipher analysis of 
all the 18 strains of SARS-CoV genomes are available at 
http://www.igib.res.in/sarsanalysis.html 
Contact: skb@igib.res.in 
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1 

GeneDecipher predicts polyprotein lab (265..21 485) in two fragments 
(265.. 13 413) and (13 599..21485) because there is a stop codon at location 
13 413. These locations are given with respect to the NCBI refseq Genome 
sequence. 


INTRODUCTION 

Severe acute respiratory syndrome (SARS) has emerged as 
a life-threatening disease. Early reports on SARS appeared 
from China (Ksiazek et al., 2003) and subsequently, cases 
of SARS were reported from Taiwan, Vietnam, Canada, 
Singapore and other countries. The range of symptoms 
observed in SARS-affected patients is fever, dry cough, 
dyspnea, headache and hypoxemia. Typical laboratory find¬ 
ings include lymphopenia and mildly elevated aminotrans¬ 
ferase levels. Death may result from progressive respiratory 
failure due to alveolar damage (Tsang et al., 2003). On 
an average, the mortality rate was 4%, though it varied 
widely according to the geographic location (WHO Report, 
2003, http://www.who.int/csr/sarscountry/2003_04_04/en/) 
and with the strain implicated. SARS isolates from different 
parts of the world have been sequenced recently. Sequence 
analysis of nucleic acid fragments isolated from cytopathic 
Vero cell cultures showed that the encoded protein sequences 
were similar to proteins of other coronaviruses (Drosten 
et al., 2003, www.nejm.org). However, at the nucleic acid 
level, no similarity was observed with any sequence in the 
database indicating substantial diversity. Phylogenetic ana¬ 
lysis showed that the isolated sequence is distinct and is 
placed between group2 and group3 coronaviruses in the tree 
(Marra et al., 2003). 

Current computational methods like GeneMark.hmm 
(Lukashin and Borodovsky, 1998), Glimmer (Salzberg et al., 
1998), etc. face difficulty in analyzing the SARS genome due 
to its small size. Methods based on hidden Markov models 
(HMM) require thousands of parameters for training. This 
makes these methods less suitable for analyzing smaller gen¬ 
omes. The problem compounds in the case of SARS-CoV 
genomes which are about 30 kb in length. Even the method 
most suitable for viral gene prediction till date ZCURVE_CoV 
(Chen et al., 2003) needs 33 parameters for training. 

GeneDecipher originally developed for prokaryotic gene 
prediction, needs only five parameters and can therefore 
analyze smaller genomes too. We have trained the artificial 
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neural network (ANN) on E.coli-kl2 genome coding and 
non-coding regions [open reading frames (ORFs) not reported 
as a gene]. No additional training is required to predict protein¬ 
coding genes using GeneDecipher on viral genomes. This 
is an obvious advantage of this method over other meth¬ 
ods. In addition, it is very difficult to find negative training 
set (non-coding regions) for small genomes like coronavirus. 
Non-coding sequences for training are made by shuffling the 
coding sequences (Chen etal., 2003). The obviation of need to 
train specifically for the organism thus makes GeneDecipher 
suitable for such small genomes. 

In continuation, we tried to assign function to the Gene¬ 
Decipher predicted SARS-CoV genes using peptide literary 
based homology search tool (PLHOST), a tool for functional 
prediction developed at our laboratory. PLHOST assigns 
function based upon the presence of invariant octa/hepta pep¬ 
tides across proteins from different species. In this paper, 
we present the results of our analysis on 18 SARS-CoV 
genomes. 

METHODS 

SARS-CoV genome sequence 

Sequences of the 18 SARS-CoV strains available in 
the GenBank database (http://www.ncbi.nlm.nih.gov/Entrez/ 
genomes/viruses) were downloaded and analyzed. These incl¬ 
ude SARS-CoV Refseq (NC_004718.3), SARS-CoV TWC 
(AY32118), SIN2774 (AY283798), SIN2748 (AY283797), 
SIN2679 (AY283796), SIN2677 (AY283794), SIN2500 
(AY283794), Frankfurtl (AY291315), BJ04 (AY279354), 
BJ03 (AY278490), BJ02 (AY278487), GZ01 (AY278848), 
CUHKW1 (AY278554), TOR2 (AY274119), TW1 
(AY291451), BJ01 (AY278488), Urbani (AY278741), HKU- 
39849 (AY278491). Other information related to protein¬ 
coding genes were retrieved from http://www.ncbi.nlm.nih. 
gov/genomes/S ARS/S ARS. html 

GeneDecipher: Protein-coding gene prediction 
software 

Originally, GeneDecipher was developed for prokaryotic gene 
prediction. To execute GeneDecipher on viral genomes we 
prepared a heptapeptide library derived from the proteins of 
56 completely sequenced prokaryotic genomes and 1096 viral 
genomes. 

Development of GeneDecipher is based upon the obser¬ 
vation that difference between total number of theoretically 
possible peptides of a given length and that which are actually 
observed in nature, grows drastically as this length of peptide 
increases. Moreover, it is interesting to note that most of these 
peptides selected by nature are found only in coding regions 
and very rarely in theoretically translated non-coding regions. 
This observation has prompted us to exploit this exclusivity of 
natural selection of peptides that are present in protein-coding 


sequences to differentiate between coding and non-coding 
regions. 

Prediction of a given ORF as a coding region/gene is 
based upon the number of heptapeptides present and the 
distribution of these heptapeptide along the ORF. Our 
output corresponding to a given ORF is a probability value 
(probability of this ORF being a gene). The final cut-off prob¬ 
ability is user dependent, but it is constant for a given genome 
in all six reading frames (default cut-off is 0.5). 

Here, it is worth noting that our method is independent of 
any other evidences, e.g. ribosome binding site signals (in 
order to prove the strength of the hypothesis) such kinds of 
constraints are being used by various existing methods. 

The method can be divided into five major steps (Fig. 1): 

(1) Generation of a peptide library. 

(2) Artificial translation of a given genome into six reading 
frames. 

(3) Conversion of each translated sequence into an integer- 
coded sequence (one corresponding to each reading 
frame). 

(4) Training of ANN. 

(5) Deciphering genes using trained ANN. 

PLHOST: Function assignment tool 

We used PLHOST for the identification of invariant pep¬ 
tides, which serve as functional signatures from completely 
sequenced genomes (Brahmachari and Dash, 2001). 

The algorithm generates organism-specific libraries of 
octa/hepta peptides from all proteins of selected genomes. 
Redundant peptides are removed from each library. These 
peptide libraries are then compared with each other to note all 
octa/hepta peptides present invariantly across a specified min¬ 
imum number of genomes. Overlapping octa/hepta peptides 
are backstitched to generate longer conserved peptides, which 
occur in functionally similar proteins, hence called functional 
signatures. 

RESULTS AND DISCUSSION 

A systematic sensitivity and specificity analysis of Gene¬ 
Decipher has been done on 10 microbial genomes (Fig. 2). 
Further analysis of GeneDecipher on viral genomes is 
presented here. 

Testing of GeneDecipher on viral genomes 

To test our method on viral genomes, we first analyzed 
human respiratory syncytial virus (HRSV), complete gen¬ 
ome using GeneDecipher. Comparison of GeneDecipher 
results with the state-of-the-art method ZCURVE_CoV has 
been done (Table 1). ZCURVE_CoV is able to predict 
8 annotated proteins out of 11 reported at NCBI without 
any false positives. ZCURVE_CoV was unable to predict 
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.ATGCCTAAGTACCGTTCCGCCACCACCACT. 


Translate in all 6 frames 


1 


.CACCGGAATGACCGACGCCGATTTCGGTAA. 

Nucleotide string 


3 forward frames 


mPKYRSATTT.HRNDRRRFRz... 

..CLSTVPPPPL.TGmTDADFG. 

. A/VPI RIIIIII.PEzPTPISV. 


Hypothetical 
proteins in 6 frames 


Search each overlapping heptapeptide in the 
library and report occurrence profile. Peptides 
starting with ‘m’ is replaced by ‘s’ and those ^ r 

containing ‘z’ are replaced by 

...slllll 1447.. .000*******... 


6 Integer coded strings 


.000000000.s000000s. 

**0000 ******* 

The integer represents number 
of organisms in which the 
heptapeptide is present in the 
library. More than 9 occurrence 
value is treated as 9. 

regions (ORFs) 

ANN trained on E. coli-K12 
genome 

f 

Predicted protein coding regions 


Split the integer strings into fragments with start ('s') 
coded by ATG, GTG, TTG and stop codonf*’) coded 
by TTA, TAG and TGA. Seven consecutive ‘*‘ in the 
integer coded sequence denotes end of a gene. ^ f 

All possible coding 


Peptide Library format 


Heptapeptide 

Occurrence value 

AAAALMH 

2 

AAAAAAC 

5 

ADAAAAA 

6 

KYRSATT 

1 

LLGGRKV 

4 

NGGDTRS 

7 

PKYRSAT 

1 


Fig. 1. GeneDecipher flow diagram. 
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Fig. 2. Sensitivity and specificity of GeneDecipher. 
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Table 1. Comparison of GeneDecipher results with ZCURVE_CoV results 
on HRSV genome, with respect to annotated genes 


Annotated genes 

Stait End Length 

ZCURVE_CoV 
Start End 

Length 

GeneDecipher 
Start End 

Length 

99 

518 

139 

99 

518 

139 

99 

518 

139 

626 

1000 

124 

— 

— 

— 

626 

1000 

124 

1140 

2315 

391 

1140 

2315 

391 

1140 

2315 

391 

2348 

3073 

241 

2348 

3073 

241 

2348 

3073 

241 

3263 

4033 

256 

3158 

4033 

291 

3158 

4033 

291 

4303 

4500 

65 

4303 

4500 

65 

4303 

4500 

65 

4690 

5589 

299 

— 

— 

— 

4690 

5589 

299 

5666 

7390 

574 

5666 

7390 

574 

5621 

7390 

589 

7618 

8205 

195 

7618 

8205 

195 

7618 

8205 

195 

8171 

8443 

90 

— 

— 

— 

— 

— 

— 

8509 

15 009 

2166 

8443 

15 009 

2188 

8443 

15 009 

2188 


the following three genes: PID 9629200 (location 626..1000, 
non-structural protein 2 (NS2)); PID 9629205 (location 
4690..5589, attachment glycoprotein (G)) and PID 9629208 
(location 8171..8443, matrix protein 2 (M2)). GeneDecipher 
predicted 10 out of total 11 annotated proteins of HRSV 
without any false positives. The gene missed by GeneDecipher 
was PID 9629208 (location 8171..8443, matrix protein 2), 
which was notably missed by ZCURVE_CoV too. 

This successful prediction of protein-coding regions in 
HRSV genome increases our confidence to predict protein¬ 
coding regions on newly sequenced SARS-CoV genomes. 

Analysis of SARS-CoV using GeneDecipher 

We analyzed all 18 strains of SARS-CoV using GeneDecipher 
(detailed results are available on the website given above). 
GeneDecipher predicts a total of 15 protein-coding regions 
in SARS-CoV genomes including both the polyproteins la 
and lab (Sars2628 C-terminal end of Polyprotein lab), and 
all four known structural proteins (M, N, S and E) for each of 
the 18 strains. GeneDecipher also predicts six to eight addi¬ 
tional coding regions depending on the genome sequence of 
the strain used. The length of these additional coding regions 
varied between 61 and 274 amino acids. 

GeneDecipher predicts 12 coding regions, which are com¬ 
mon to all 18 strains (Table 2), and one coding region (Sars63, 
Sars6 at NCBI refseq genome) present in five strains. Gene¬ 
Decipher predicts gene Sars90 in GZ01 strain and Sarsl54 
(Sars 3b at NCBI refseq genome) in BJ02 strain specifically. 

These 12 common protein-coding regions consist of the 
six basic proteins of SARS-CoV (two polyproteins and 
the four structural proteins): Sars274 (Sars3a at NCBI ref¬ 
seq database), Sarsl22 (Sars7a at NCBI refseq database), 
Sars78 (already reported with start shifted as ORF14/Sars9c 
in TOR2 strain); and three newly predicted (false positives 
with respect to current annotation at NCBI) protein-coding 


Table 2. Protein-coding genes predicted by GeneDecipher in SARS-CoV 
Refseq common to all 18 strains 


S. no. 

Start 

Stop 

Frame 

Length 

bp 

aa 

Feature 

1 

265 

13413 

i+ 

13149 

4382 

Sars la polyprotein 

2 

701 

1225 

2+ 

525 

174 

Sars 174 (new prediction) 

3 

1397 

1603 

2+ 

207 

68 

Sars68 (new prediction) 

4 

8828 

9013 

2+ 

186 

61 

Sars61 (new prediction) 

5 

13 599 

21485 

3+ 

7887 

2628 

Sars2628 (C-terminal end 
of polyprotein 1 ab) 

6 

21492 

25 259 

3+ 

3768 

1255 

Spike (S) protein 

7 

25 268 

26092 

2+ 

825 

274 

Sars274 (Sars 3a) 

8 

26117 

26347 

2+ 

231 

76 

Sars76 (Sars4) 

9 

26398 

27063 

1 + 

666 

221 

Sars221 (Sars5) 

10 

27 273 

27 641 

3+ 

369 

122 

Sarsl22 (Sars7a) 

11 

28120 

29388 

1+ 

1269 

422 

Sars422 (Sars9a) 

12 

28559 

28795 

2+ 

237 

78 

Sars78 (Identical to ORF 
14/Sars9c in TOR2 
with shifted start) 


regions Sarsl74, Sars68 and Sars61. The three newly pre¬ 
dicted genes lie completely within poly protein la genomic 
region. Although our method discards such genes in bacterial 
genomes, possibility of finding such genes in viral genomes 
has not been ruled out. As these genes are present in all 
18 strains, it is likely that they are protein-coding genes. 

We predict three more coding regions Sars63, Sarsl54 and 
Sars90 apart from the 12 discussed above. Sars63 is identi¬ 
fied in five strains and not identified in remaining 13 strains. 
This coding region is already reported in NCBI refseq (Sars6). 
Here, we cannot comment much about the existence of Sars63 
(Sars6 at NCBI refseq) because it is identified in five strains 
and not identified in rest 13. This is due to high density of 
non-synonymous mutations across strains in this region. Two 
coding regions Sarsl54 (Sars3b at NCBI) and Sars90 (newly 
predicted in GZ01 strain) are identified in only one strain. 
Since these two coding regions are identified in only one strain, 
they are less likely to be protein-coding regions, as also sug¬ 
gested by ZCURVE_CoV (Chen et al., 2003) analysis. The 
locations of these three genes in different strains are provided 
in Table 3. 

Since the peptide libraries are made from the genome 
sequences of various organisms, the evolutionary origin of 
a given protein can be traced. If the protein is rich in hepta- 
peptides found occurring in viral genomes, then that protein 
is considered to be of viral origin. We found that five core 
proteins (two polyproteins and three structural proteins M, N 
and S) are of viral origin. The remaining, including three new 
predictions, are of prokaryotic origin. It is interesting to note 
that from the same DNA region we are getting proteins in dif¬ 
ferent frames, which contain peptides from different origin. 
Here, how same DNA sequence can code for both bacterial and 
viral origin is intriguing. This might explain why these new 
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Table 3. Identification of Sars90, Sars63, Sarsl54 as protein-coding genes 
by GeneDecipher in various strains of SARS-CoV 


S. no. 

Strain name 

Sars90 (New 
prediction) 

Sars63 (Sars6 
at NCBI) 

Sarsl54 (Sars3b 
at NCBI) 

1 

SIN2748 

_ 

_ 

_ 

2 

BJ01 

— 

27055..27246 

— 

3 

BJ02 

— 

27074..27265 

25689..26153 

4 

BJ03 

— 

27070..27261 

— 

5 

BJ04 

— 

27058..27249 

— 

6 

Frankfurttl 

— 

— 

— 

7 

Urbani 

— 

— 

— 

8 

GZ01 

24492..24764 

27058..27249 

— 

9 

sin2500 

— 

— 

— 

10 

sin2677 

— 

— 

— 

11 

sin2679 

— 

— 

— 

12 

sin2774 

— 

— 

— 

13 

CHUKW1 

— 

— 

— 

14 

TW1 

— 

— 

— 

15 

TWC 

— 

— 

— 

16 

HKU-39849 

— 

— 

— 

17 

Refseq 

— 

— 

— 

18 

TOR2 

— 

— 

— 


protein-coding genes were not detected in primary attempts 
based on homology to other known viral genome sequences. 

Comparison with the existing 
system—ZCURVE_CoV 

Comparison of GeneDecipher, ZCURVE_CoV results with 
the known annotations for Urbani and TOR2 strains of SARS- 
CoV are presented in Tables 4 and 5. 

In general, GeneDecipher results are in good agreement 
with the known annotations. In case of Urbani strain, Gene¬ 
Decipher predicts all the known genes except Sars84(X5), 
Sars63(X3) and Sarsl54(X2). Sars84(X5) and Sars63(X3) 
are supported by ZCURVE_CoV whereas Sarsl54(X2) is 
missed by both the methods. GeneDecipher predicts four 
new genes in this strain which, incidentally, are not sup¬ 
ported by ZCURVE_CoV. It is noticeable that, out of these 
four genes, Sars78 is already known for strain TOR2 as 
ORF14/Sars9c. This supports the likelihood of the gene being 
present in Urbani strain. However, ZCURVE_CoV predicts 
two new genes, which are not supported by GeneDecipher 
either. 

GeneDecipher predictions for TOR2 strain are identical 
with those for Urbani strain. In this strain, GeneDecipher pre¬ 
dicts nine known genes but fails to predict six genes with 
known annotations. These six genes are: Sarsl54 (ORF4), 
Sars98 (ORF13), Sars63 (ORF7), Sars44 (ORF9), Sars39 
(ORFIO) and Sars84 (ORF11). Of these, Sarsl54 (ORF4) 
and Sars98 (ORF13) are also missed by ZCURVE_CoV. It is 
to be noted that both Sars44 (ORF9) and Sars39 (ORFIO) are 
ORFs that are very small in length (44 and 39 amino acids. 


respectively), and their presence too is not consistent across 
various SARS strains. Sars63 (ORF7) has been predicted by 
GeneDecipher in five other strains but not in the two strains 
considered here. 

Mutation analysis 

Analysis using multiple sequence alignment (ClustalW) for 
three newly predicted protein-coding genes Sarsl74, Sars68 
and Sars61 across all 18 strains shows the following: 

1. Sars68 has one point mutation at location 80 GAT —> 
GGT (D ^ G) SIN 2677 strain. 

2. Sars 174 has two synonymous point mutations at location 
204 CGA —> CGC in GZ01 strain and at location 447 
CTG —> CTT in BJ04 strain. 

3. Sars61 has one point mutation at location 119 CTG —> 
CAG (F —► Q) in GZ01 strain. 

These three newly predicted genes are present in all 18 strains 
without significant mutations and has no significant hits with 
BFASTP in non-redundant database. This indicates that these 
three proteins might have crucial biological functions specific 
to SARS-CoV. Therefore, these coding sequences might serve 
as candidate drug targets against SARS. 

Function assignment 

In total, we predict 15 coding regions in SARS-CoV out of 
which functions of the four structural proteins (M, N, S and E) 
have already been assigned. Although the polyprotein lab has 
been assigned only replicase activity, our analysis implies that 
the replicase activity is associated with Sars2628 (C-terminal 
of ORF lab) fragment. The complete 1 ab polyprotein contains 
six functional signatures of which polyprotein la contains 
signatures associated with metabolic enzymes (Table 6). 
Functions were assigned to the polyproteins on the basis of 
peptides (length 7 or more amino acids) occurring in proteins 
having similar functions in at least five different organisms. 
Other predicted genes/protein-coding regions contain pep¬ 
tides, which occur in fewer genomes. Based on these peptides 
we suggest functions, albeit with lesser confidence (Table 7). 
The biological relevance of these findings remains to be 
explored. 

CONCLUSION 

In this paper, we have predicted four new genes includ¬ 
ing Sars78 (already known in TOR2 strain) in SARS-CoV. 
Our analysis also corroborates the finding of ZCURVE_CoV 
(Chen et al., 2003) that ORF Sarsl54 (listed in Refseq as 
Sars3b) is unlikely to be a coding region. We have also 
assigned functions to the two polyproteins 1 ab and 1 a. In addi¬ 
tion to replication-associated function of C-terminal of lab 
polyprotein, our analysis implies that the polyprotein la may 
be associated with metabolic enzyme-like functions. In all, 
six peptide signatures are present in polyprotein lab. We have 
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Table 4. Comparison of GeneDecipher results with ZCURVE_CoV results on SARS-CoV genome Urbani strain, with respect to annotated genes 


Annotated 

Start 

genes 

End 

Length 

ZCURVE_CoV 

Start End 

Length 

GeneDecipher 

Start End 

Length 

Features 

265 

13 398 

4377 

265 

13 398 

4377 

265 

13413 

4382 

ORF la 

— 

— 

— 

— 

— 

— 

701 

1225 

174 

Sarsl74 (New prediction by 
GeneDecipher) 

— 

— 

— 

— 

— 

— 

1397 

1603 

68 

Sars68 (New prediction by 
GeneDecipher) 

— 

— 

— 

— 

— 

— 

8828 

9013 

61 

Sars61 (New prediction by 
GeneDecipher) 

13 398 

21485 

2695 

13 398 

21485 

2695 

13 599 

21485 

2628 

ORF lb 

21492 

25 259 

1255 

21492 

25 259 

1255 

21492 

25 259 

1255 

S protein 

25 268 

26092 

274 

25 268 

26092 

274 

25 268 

26092 

274 

Sars274 (XI) 

25 689 

26153 

154 

— 

— 

— 

— 

— 

— 

Sarsl54 (X2) 

26117 

26347 

76 

26117 

26347 

76 

26117 

26347 

76 

E protein 

26398 

27 063 

221 

26398 

27 063 

221 

26389 

27 063 

224 

M protein 

27 074 

27 265 

63 

27 074 

27 265 

63 

— 

— 

— 

Sars63 (X3) 

27 273 

27 641 

122 

27 273 

27 641 

122 

27 273 

27 641 

122 

Sarsl22 (X4) 

— 

— 

— 

27 638 

27 772 

44 

— 

— 

— 

Sars44 

— 

— 

— 

27 779 

27 898 

39 

— 

— 

— 

Sars39 

27 864 

28118 

84 

27 864 

28118 

84 

— 

— 

— 

Sars84 (X5) 

28120 

29388 

422 

28120 

29388 

422 

28120 

29388 

422 

N protein 







28 559 

28 795 

78 

Sars78 (Identical to ORF 
14/Sars9c in TOR2 with 
shifted start) 


Table 5. Comparison of GeneDecipher results with ZCURVE_CoV results on SARS-CoV genome TOR2 strain, with respect to annotated genes 


Annotated genes ZCURVE_CoV predicted genes GeneDecipher predicted genes Features 

Start End Length Start End Length Start End Length 


265 

13 398 

4377 

13 398 

21485 

2695 

21492 

25 259 

1255 

25 268 

26092 

274 

25 689 

26153 

154 

26117 

26347 

76 

26 398 

27063 

221 

27 074 

27 265 

63 

27 273 

27 641 

122 

27 638 

27 772 

44 

27 779 

27 898 

39 

27 864 

28 118 

84 

28120 

29 388 

422 

28130 

28426 

98 

28 583 

28795 

70 


265 

13 398 

4377 

13 398 

21485 

2695 

21492 

25 259 

1255 

25 268 

26092 

274 

26117 

26347 

76 

26 398 

27 063 

221 

27 074 

27 265 

63 

27 273 

27 641 

122 

27 638 

27 772 

44 

27 779 

27 898 

39 

27 864 

28118 

84 

28 120 

29388 

422 


265 

13413 

4382 

701 

1225 

174 

1397 

1603 

68 

8828 

9013 

61 

13 599 

21485 

2628 

21492 

25 259 

1255 

25 268 

26092 

274 

26117 

26347 

76 

26389 

27 063 

224 

27 273 

27 641 

122 

28 120 

29 388 

422 

28559 

28 795 

78 


ORF la 

Sarsl74 (New prediction by 
GeneDecipher) 

Sars68 (New prediction by 
GeneDecipher) 

Sars61 (New prediction by 
GeneDecipher) 

ORF lb 
S protein 
ORF3 (Sars274) 

ORF4 (Sarsl54) 

E protein 
M protein 
Sars63 (ORF7) 

Sarsl22 (ORF8) 

Sars44 (ORF9) 

Sars39 (ORFIO) 

Sars84 (ORF11) 

N protein 
ORF13 

Sars78 (Identical to ORF 
14/Sars9c in TOR2 with 
shifted start) 
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Table 6. Functional assignment of polyproteins in 

SARS (Urbani) Genome using PLHOST 


S. no. 

NCBI annotation 

Conserved peptide signature 

Function assigned 

i 

Sarslab (Poly protein lab) 

RIRASLPT 

Phosphoglycerate kinase 



RSETLLPL 

Sulfite reductase (NADPH), flavoprotein beta subunit 



LDKLKSLL 

Probable acyl-CoA thiolase 



ATVVIGTS 

cell division protein ftsZ 



NVAITRAK 

DNA-binding protein, probably DNA helicase 



LQGPPGTGK 

DNA helicase related protein 

2 

Sars la poly protein la 

RIRASLPT 

Phosphoglycerate kinase 



RSETLLPL 

Sulfite reductase (NADPH), Flavoprotein beta subunit 



LDKLKSLL 

Probable acyl-CoA thiolase 

3 

Sars 2628 (C terminal of Sars lab) 

ATVVIGTS 

cell division protein ftsZ 



NVAITRAK 

DNA-binding protein, probably DNA helicase 



LQGPPGTGK 

DNA helicase related protein 


Table 7. Suggested functions for some of the non-structural genes in SARS- 
CoV using PLHOST 


S. no. Gene Peptide signature Suggested function 


i 

Sars 174 (new 

TLSKGNAQ 


prediction) 

VAQMGTLL 

2 

Sars68 (new 

LVLVLILA 


prediction) 

TQTLKLDS 

3* 

Sars90 (new 

GLLHRGT 


prediction only 
in GZ01 strain) 


4 

Sars61 (new 

LLPLLAFL 


prediction) 


5 

Sars274 (Sars3a) 

LLLFVTIY 

6 

Sars 154 (Sars3b) 

QTLVLKML 

7 

Sars63 (Sars6) 

DDEELMEL 

8 

Sars 122 (Sars7a) 

LIVAALVF 



RARSVSPK 

9* 

Sars78 (Sars9c) 

QLLAAVG 


ABC transporter ATP binding 
protein (Lactococcus lactis 
subsp. lactis ) 

Cytochrome c oxidase folding 
protein ( Synechocystis sp. 
PCC6803 ) 

Putative major facilitator 
superfamily protein 
{Schizosaccharomyces 
pombes ) 

Serine/threonine kinase 2; 
Serine/threonine protein 
kinase-2 {Homo sapiens ) 

NADH dehydrogenase I chain 


Putative protein (conserved 
across two organisms) 

Poly amine transport protein; 
Tpolp {Saccharomyces 
cerevisiae) 

K550.3.p ( Caenorhabditis 
elegans ) 

Elongation factor Tu 

{Lactococcus lactis subsp. 
lactis) 

Putative transport 

transmembrane protein 
{Sinorhizobium meliloti ) 

Src homology domain 3 
(C. elegans) 

Gamma-glutamate kinase 
(conserved across 8 
organisms) 


*No conserved octapeptide was found. However, function has been assigned on the basis 
of the only highly conserved heptapeptide. 


suggested putative function for other nine proteins including 
the ones newly predicted by GeneDecipher. 
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